Goto

Collaborating Authors

 kullback-leibler divergence


Supplementary Materials for " Multi-Agent Meta-Reinforcement Learning " AT echnical Lemmas

Neural Information Processing Systems

From the three-points identity of the Bregman divergence (Lemma 3.1 of [9]), KL (x y) KL ( x y) = KL (x x) + ln x ln y,x x (12) The first term in (12) can be bounded by KL (x x) = By the Hölder's inequality, the second term in (12) is bounded as ln x ln y,x x ln x ln y Lemma 5. Consider a block diagonal matrix We prove the lemma via induction on N . This completes the induction proof.Lemma 6. We introduce one more notation before presenting the proof. This leads us to the initialization-dependent convergence rate of Algorithm 1, which we re-state and prove as follows. In addition, if we initialize the players' policies to be uniform policies, i.e., The rest of the proof follows by putting all the aforementioned results together.


The Learnability of In-Context Learning

Neural Information Processing Systems

Our theoretical analysis reveals that in this setting, in-context learning is more about identifying the task than about learning it, a result which is in line with a series of recent empirical findings.


Bipartite Stochastic Block Models with Tiny Clusters

Stefan Neumann

Neural Information Processing Systems

Discovering clusters in bipartite graphs has been researched in many different settings. However, most of these algorithms were heuristics and do not provide theoretical guarantees for the quality oftheir results.




On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions

Neural Information Processing Systems

Kullback-Leibler (KL) divergence is one of the most important measures to calculate the difference between probability distributions. In this paper, we theoretically study several properties of KL divergence between multivariate Gaussian distributions.


Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

Neural Information Processing Systems

Since pioneering work of Hinton et al., knowledge distillation based on Kullback-Leibler Divergence (KL-Div) has been predominant, and recently its variants have achieved compelling performance. However, KL-Div only compares probabilities of the corresponding category between the teacher and student while lacking a mechanism for cross-category comparison. Besides, KL-Div is problematic when applied to intermediate layers, as it cannot handle non-overlapping distributions and is unaware of geometry of the underlying manifold. To address these downsides, we propose a methodology of Wasserstein Distance (WD) based knowledge distillation. Specifically, we propose a logit distillation method called WKD-L based on discrete WD, which performs cross-category comparison of probabilities and thus can explicitly leverage rich interrelations among categories. Moreover, we introduce a feature distillation method called WKD-F, which uses a parametric method for modeling feature distributions and adopts continuous WD for transferring knowledge from intermediate layers. Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors.


Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence

Neural Information Processing Systems

Existing rotated object detectors are mostly inherited from the horizontal detection paradigm, as the latter has evolved into a well-developed area. However, these detectors are difficult to perform prominently in high-precision detection due to the limitation of current regression loss design, especially for objects with large aspect ratios. Taking the perspective that horizontal detection is a special case for rotated object detection, in this paper, we are motivated to change the design of rotation regression loss from induction paradigm to deduction methodology, in terms of the relation between rotation and horizontal detection. We show that one essential challenge is how to modulate the coupled parameters in the rotation regression loss, as such the estimated parameters can influence to each other during the dynamic joint optimization, in an adaptive and synergetic way. Specifically, we first convert the rotated bounding box into a 2-D Gaussian distribution, and then calculate the Kullback-Leibler Divergence (KLD) between the Gaussian distributions as the regression loss.


Chicken Swarm Kernel Particle Filter: A Structured Rejuvenation Approach with KLD-Efficient Sampling

Tian, Hangshuo

arXiv.org Artificial Intelligence

Particle filters (PFs) are often combined with swarm intelligence (SI) algorithms, such as Chicken Swarm Optimization (CSO), for particle rejuvenation. Separately, Kullback--Leibler divergence (KLD) sampling is a common strategy for adaptively sizing the particle set. However, the theoretical interaction between SI-based rejuvenation kernels and KLD-based adaptive sampling is not yet fully understood. This paper investigates this specific interaction. We analyze, under a simplified modeling framework, the effect of the CSO rejuvenation step on the particle set distribution. We propose that the fitness-driven updates inherent in CSO can be approximated as a form of mean-square contraction. This contraction tends to produce a particle distribution that is more concentrated than that of a baseline PF, or in mathematical terms, a distribution that is plausibly more ``peaked'' in a majorization sense. By applying Karamata's inequality to the concave function that governs the expected bin occupancy in KLD-sampling, our analysis suggests a connection: under the stated assumptions, the CSO-enhanced PF (CPF) is expected to require a lower \emph{expected} particle count than the standard PF to satisfy the same statistical error bound. The goal of this study is not to provide a fully general proof, but rather to offer a tractable theoretical framework that helps to interpret the computational efficiency empirically observed when combining these techniques, and to provide a starting point for designing more efficient adaptive filters.


Estimation of discrete distributions with high probability under $χ^2$-divergence

Louati, Sirine

arXiv.org Machine Learning

We investigate the high-probability estimation of discrete distributions from an \iid sample under $χ^2$-divergence loss. Although the minimax risk in expectation is well understood, its high-probability counterpart remains largely unexplored. We provide sharp upper and lower bounds for the classical Laplace estimator, showing that it achieves optimal performance among estimators that do not rely on the confidence level. We further characterize the minimax high-probability risk for any estimator and demonstrate that it can be attained through a simple smoothing strategy. Our analysis highlights an intrinsic separation between asymptotic and non-asymptotic guarantees, with the latter suffering from an unavoidable overhead. This work sharpens existing guarantees and advances the theoretical understanding of divergence-based estimation.